Biostatistics (2012), 0, 0, pp. 1-21 
doi: 10. 1093 /biostatistics/LDD 



Fluctuation analysis with cell deaths 

BERNARD YCART 

Laboratoire Jean Kuntzmann 
Universite de Grenoble and CNRS UMR 5224 

51 rue des Mathematiques 38041 Grenoble cedex 9, France 
Bernard.Ycart@iniag.fr 

Summary 

The classical Luria-Delbriick model for fluctuation analysis is extended to the case where cells 
can either divide or die at the end of their generation time. This leads to a family of proba- 
bility distributions generalizing the Luria-Dclbriick family, and depending on 3 parameters: the 
expected number of mutations, the relative fitness of normal cells compared to mutants, and the 
death probability of mutants. The probabilistic treatment is similar to that of the classical case; 
simulation and computing algorithms are given. The estimation problem is discussed: if the death 
probability is known, the two other parameters can be reliably estimated. If the death probability 
is unknown, the model can be identified only for large samples. 
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1. Introduction 

Since it appeared more than 60 years ago, the Luria-Dclbriick distribution has been widely used 
as a model for the occurrence of mutants in cell cultures: see chap. II p. 59 of Kendall (1952) for an 
early review, and Zheng, 1999, 2010 for more recent ones. It is obtained as a limit when the initial 
number of cells and the experiment time are large, and the mutation probability is small. One of 
the underlying hypotheses is that cells only divide and never die, which is untrue in reality, even 
though the probability of death has been estimated to rather low values (Stewart and others, 
2005; Fontaine and others, 2008). A Markovian model of mutations including cell deaths was 
considered by Tan (1982), who proposed a computing algorithm for the distribution of mutants. 
Angerer (2001, section 3) also discussed the influence of cell death on the distribution of mutants. 
To the best of our knowledge no explicit representation of the distribution of mutants in a general 
model including cell deaths, and no quantitive study of the influence of deaths on the estimation 
of parameters have appeared so far. Our objective here is to extend the classical Luria-Delbriick 
model to the case where cells have a certain probability to die rather than divide, and provide 
statistical tools for the estimation of the parameters. 

Our hypotheses are the following: 

• at time a homogeneous population of n normal cells is given; 

• the generation time of any normal cell is a random variable with distribution G; 

• upon completion of the generation time of a normal cell: 

— with probability p one normal and one mutant cell are produced; 

— with probability q the cell dies out; 

— with probability 1 — p — q two normal cells are produced, 

• the generation time of any mutant cell is exponentially distributed with parameter ^*; 
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• upon completion of the generation time of a mutant cell: 

— with probability 6 the cell dies out; 

— with probability 1 — S two mutant cells are produced, 

• all random variables and events (division times, mutations, and deaths) are mutually inde- 



Considcr an initial (large) number n of normal cells. Assume that the mutation probability p is 
small, that the time t at which mutants arc counted is large, and that the asymptotics arc such 
that the expected number of mutations a before time t is non null and finite (precise hypotheses 
and statements will be given in section 2). Denote by i/ and fi the exponential growth rates of 
normal and mutant cells respectively, and hy p ^ v / ^ the relative fitness. It will be shown that 
the total number of mutants at time t approximately follows an integer valued distribution, whose 
probability generating function (PGF) is given by: 



The parameters are: 

1. a: the expected number of mutations 

2. p: the relative fitness of normal cells compared to mutants. 

3. 5: the death probability of a mutant cell. 

For (5 = 0, the Luria-Dclbriick distribution with parameters a and p, or LD(a, p), is obtained as a 
particular case. We propose to name "Luria-Delbriick with deaths", and denote by LDD(a,/9, (5), 



pendent. 



ga,p,5{z) = exp(Q;(/lp,5(z) - 1)) , 



(1.1) 



where: 




(1.2) 
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the distribution on integers with PGF ga.p.s- We propose a statistical study of the LDD(q;, p, 5) 
including: 

• fast simulation algorithm, 

• computation of probabilities, 

• asymptotic probabilities for large values, 

• point estimation of parameters, 

• confidence intervals. 

We have developed in R (R Development Core Team (2008)) a set of functions that perform the 
usual operations on the LDD distributions (simulation, distribution function and quantile com- 
putation), output estimates and confidence intervals. These functions have been made available 
online: http : //www. Ijk. imag.fr/membres/Bernard.Ycart/LD/. 

The paper is organized as follows. In section 2, the theoretical justification of the model 
is given. It is based on standard results from branching process theory. A simple probabilistic 
interpretation will be given. Section 3 describes the simulation and computation algorithms of the 
LDD(a, p, S): they are quite similar to those known for the LD{a, p) Zheng (2005). The estimation 
problem is addressed in section 4. The proposed method is based on generating function estimates, 
extending those proposed for the LD(q;,/9) in Hanion and Ycart (2012). Experimental results, 
both on simulated and real data are reported in section 5. 

2. ASYMPTOTICS FOR NUMBERS OF MUTANTS 

The results exposed in this section are applications of the general theory of supercritical age- 
dependent continuous time branching processes (or Bellman-Harris processes): sec Harris (1963, 
Chap. VI) and Athreya and Ney (1972. Chap. IV) as general references. They are similar to those 
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detailed in section 2 of Hamon and Ycart (2012), and we shall mainly develop those differences 
with the classical model, that arise from taking cell deaths into account. 

Firstly, consider the number of normal cells as a function of time. Recall that the generation 
times are assumed to be independent and identically distributed (i.i.d.) random variables, with 
common distribution G. Upon completion of a generation time, the number of (normal) offspring 
is: 

with probability q (death), 

1 with probability p (mutation) , 

2 with probability 1 — p — q (division) . 
Therefore the PGF of the offspring distribution is: 

q + pz + {l-p-q)z'^ , 

and its expectation (mean offspring number) is m = p + 2(1 — p — q). We shall assume that the 
mean offspring number is larger than 1, so the corresponding clones may survive with positive 
probability (supercritical case). If one normal cell is initially present, then either the population 
dies out or it grows exponentially. The probability that it dies out is the smallest positive root 
of the equation z — q + pz + {\— p — q)z'^ , which we shall denote by e: 

q 

l-p-q 

With probability 1 — e the clone does not die out, in which case it grows exponentially. The 
exponential growth rate (or Malthusian parameter) v is defined as the unique root of the equation: 

TO / e-''MG(s) = 1 . (2.3) 

Theorem 17.1 p. 142 of Harris (1963) gives a precise meaning to the expression "exponential 
growth". It states that: 

lim E[7Vt I iVo 1, A^t > 0] e""^* = C , 



6 B. YCART 

where Nt denotes the number of normal cells at time t. The limit is the proportionality coefficient 
of exponential growth. It is given by: 

C={.^jJ^so-,Gis)) \ (2.4) 

Assume n normal cells are present at time t = 0. Let (i„) be a sequence of instants, tending to 
infinity as n tends to infinity. At large time t„, a proportion e of the clones stemming from the 
n initial cells will have died out. A proportion 1 — e grow exponentially with rate v. So the final 
number of normal cells will be asymptotically equivalent to n{l — e)Ce''*". 

Consider now mutations. Let (p„) be a sequence of mutation probabilities, tending to as n 
tends to infinity. Since (p„) tends to zero, mutations have an asymptotically null effect on the 
growth rate of the population. Indeed the mean offspring number tends to 2(1 — g), the growth 
rate tends to the unique solution i> of the equation: 



r + co 

2(1 -g) / e-^'dGis) = I 
Jo 



and the proportionality constant tends to: 



C=\v 



/ .e--MG(.) 
1 - 2g Jo 



Moreover, since the number of divisions occurring in dying clones remains bounded, they can be 
neglected, and it can be considered that mutants observed at time i„ only come from divisions 
in surviving mutant clones. Their number is asymptotically equal to the final number of normal 
cells, i.e. n{l — e)Ce^*". Assume now that: 

lim p„n(l - e)Ce''*" = a , (2.5) 

where a is some fixed positive real. The expected number of mutations tends to a, and since 
mutations are supposed to occur randomly, their number asymptotically follows the Poisson 
distribution with parameter a, by the law of small numbers. Notice that the interpretation of a 
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as the product of the mutation probabiUty by the final number of cells holds whether cell deaths 
are considered or not; thus estimating a permits to estimate the mutation probability p, dividing 
the estimate of a by the final number of cells, exactly as in classical fluctuation analysis. 

Consider now the durations between random split times of non dying clones, and the final time 
tn'. we call them split lags. Theorem 2.1 p. 669 of Kuczek (1982) states the almost sure convergence 
of the empirical distribution of split lags, to the distribution function of the exponential with 
parameter i/. From section 3 of that same article, it follows that the developing times of a fixed 
number k of mutant clones converge in distribution to the product of k independent copies of the 
exponential distribution with parameter v. 

Let us now turn to mutant clones, i.e. populations of mutant cells stemming from a single 
initial mutant cell. Recall that the generation times of mutants are assumed to be exponentially 
distributed with rate fi*. The number of mutants at time s is a linear growth birth-and-dcath 
process. The rates are proportional to the number k of cells in the population, the death rate 
being fi*Sk and the birth rate being /i*(l — S)k. We shall assume also that mutant clones may 
survive with positive propability, which occurs only if 5 < 1/2. The exponential growth rate of 
mutant clones is the difference between birth and death individual rates: 

= -S)- fi*S = 25) . 

The distribution at time s of the number of mutant cells, stemming from a single mutant cell 
at time is explicitly known, and characterized by the following PGF: Athreya and Ney (1972, 
p. 109). 

^ ' ^ " (1 - S){1 ~z) + e-M*((l - S)z - S) • ^ ' 

Let us summarize the 3 main arguments: 

Al: the number of mutations converges in distribution to the Poisson distribution with param- 
eter a: 
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A2: the joint distribution of the developing times of a fixed number k of mutant clones converges 
in distribution to the product of k independent copies of the exponential distribution with 
parameter v; 

A3: the size at time s of a mutant clone has distribution with PGF F{z, s). 

From A2, the size of any mutant clone is an exponential mixture, the PGF of which can be 
expressed using (2.6) as: 



Changing e^'"* into v yields the expression (1.2) of hp^g. The distribution with PGF hp s can be 
seen as a two-parameter extension of the Yule distribution with parameter p: it will be denoted by 
YD{p, 5) (for "Yule with deaths" ). From Al, the total number of mutants is the sum of a random 
number of sizes of independent random clones, each with YD{p, 5) distribution: the resulting 
distribution is a compound Poisson with parameter a and base YD(p, (5), hence the expression 
(1.1) of the PGF ga^p^s of the LDD(a,p,(5). 

3. Probability calculations 

Computation and simulation algorithms for the YT){p,d) and the LiDD{a, p,d) distributions are 
described in this section. The probabilities of the YD{p, 6) and LDD(a, p, S) will be denoted by 
{pk)keN and (gfc)fceN respectively. 



We begin with a probabilistic interpretation of the distribution at time s of mutant clones, the 
PGF F{z, s) of which is given by (2.6). Let us rewrite F{z, s) as: 





and 




k=0 



F{z,s) 



(5(1 - e-^") - z{S - 0-^^(1 - S)) 



(1 - (5 - ^e-^"^) - z((l - S){1 - c-^'^) ' 
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An easy series expansion yields the corresponding probabilities: 

+00 

fc=o 

with 



Po(s) = 



1-6- (5c-^^ 
and for ^ 1, 

p,{s) = {l-p,{s))n{s){l - n{s))'^-\ with 7r(.) = ^ _ ^ ■ (3.7) 

In other words, a random variable with PGF F(z, s) is a random mixture: either with probability 
Pa{s) or (with probability 1 — po{s)), a geometric random variable with parameter tt{s). The 
YD{p,d) is an exponential mixture of these distributions. Using again the change of variable 

/' /^V1 /^"'"'d., (3.8) 
Jq l-S -6v 

and for k ^ 1: 

p, = il- 6r\l 25f £ (,L'r'C)'^+i P-' ■ (3-9) 
The integral in (3.9) can be computed numerically up to rather large values of k. An equivalent 
for larger fc's can be calculated as follows. Rewrite (3.9) as: 



l-S 



/ — ^ f4 Pu'' du 

"'0 k i-s) 



The following equivalent is obtained. 



1-26^'-" 



P^,- ^"""M^-T Pr(p+1). (3.10) 

fc-i-oo \ i — / 

A well known algorithm expresses the q^s as a function of the p^'s (Embrechts and Hawkes, 
1982; Pakes, 1993). 

k 

go = e-"(i-P°\ and forfeit, qt ^ ^Y.'P^'i''-- ■ (^.11) 

1=1 
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The proof of (3.11) is easy: 



dz 



= a 



dz 



'9a,p,S 




) 



( 



Y^QkZ 



k=0 



+00 



fc 



) 



+ 00 




,fc-l 



fc=l 

The equivalent of qk is deduced from subcxponential theory: Embrechts and Hawkes (1982, The- 
orem 1). 



For any S, a heavy tail distribution with tail exponent p is obtained. 

Another consequence of the probabilistic interpretation is a simulation algorithm for the 
YD(p,(5) and the LDB{a,p,6). 

For the YD(p,(5): 

• simulate a random time s, according to the exponential distribution with parameter p; 

• compute po{s) and 7r(s); 

• make a random choice: 

— with probability po{s), output 0, 

— with probability 1 — po{s), output a geometric random number with parameter 7r(s). 
For the LDD(a,p, S): 

• simulate a random integer n according to the Poisson distribution with parameter a; 

• simulate a sample of size n of the YD(p, 5), 




(3.12) 



Fluctuation analysis with cell deaths 



11 



• output the sum of that sample. 

These two algorithms have been encoded in R, and the simulation functions are included in the 
script available online. 

4. Parameter estimation 

This section addresses the problem of parameter estimation. The main difficulty comes from the 
fact that two LDD distributions may be quite close for rather different sets of parameters; this 
makes the model hardly identifiable in practice. In order to evaluate the actual identifiability, 
we proceed as follows. Let ao and po be two given positive values, and let {qa,qi) be the first 
two probabilities of the LDD(q!o, po, 0). For any value S < 0.5, there exists a couple (a^, ps) such 
that the first two probabilities of the LiDD{as, ps,S) coincide with ((Zoi?!)- It turns out that the 
whole distribution LDD(a5, ps, S) is very close to the LDD(ao, po, 0). Let dist(J) be the maximal 
distance between the two cumulative distribution functions. Table 1 gives the values of as, ps, 
and dist((5) for = po = 1 a-nd 5 from to 0.3. Of course, dist((5) depends on aQ, pq and 
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0.03 


0.06 


0.09 


0.12 


0.15 


0.18 


0.21 


0.24 


0.27 


0.30 


as 


1 


1.02 


1.03 


1.05 


1.08 


1.10 


1.13 


1.16 


1.20 


1.25 


1.30 


ps 


1 


1.01 


1.02 


1.04 


1.05 


1.07 


1.09 


1.12 


1.15 


1.19 


1.24 


103dist(,5) 





0.64 


1.35 


2.12 


2.98 


3.94 


5.02 


6.24 


7.63 


9.24 


11.11 



Table 1. Parameters of LDD distributions that coincide with LDD (1,1,0) on and 1, with maximal 
distance between cumulative distribution functions, multiplied by 10^. 

S: it increases with ao and S, it decreases with po; but its typical order of magnitude is 10^^. 
As a consequence, there is no hope to distinguish between LDD distributions on a sample of 
a few hundred data, which is the usual size in fluctuation analysis experiments. However, the 
observed number of mutants increases with a (the expected number of mutations), and so does 
the identifiability of the model. Here are the values of dist((5) for 5 = 0.1, pa = 1, and ao from 10 
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to 50. 





10 


20 


30 


40 


50 


103dist(0.1) 


14.30 


21.91 


27.52 


32.06 


35.91 



The fact that more precise estimates should be obtained for large values of a rules out in our 
view the maximum likelihood method, as already been explained in Hamon and Ycart (2012); it 
is the main argument supporting Generating Function (GF) estimators, that allow a rescaling 
of the sample. The estimation method proposed below extends the GF estimators introduced in 
Hamon and Ycart (2012). 

Recall the PGF of the LDD(a, p, 6): 

9a,pA^) = exp(-Q!(l - hp^s{z))) . 

For ^ 5 < 0.5 and ^ z < 1, we shall denote by (5* and the following quantities. 

5 , z — 5^ 

and z. 



^ \~5 ^ 1-z 

The PGF hp^s and its derivatives with respect of p and 5, denoted by ft,^''](z) and ft,^''](z) respec- 
tively, are repeatedly needed in numerical procedures, so numerically stable expressions must be 
derived. Here are the expressions that have been implemented in our R functions. 

hps{z)^5.+z,{l-5,) f -P^dv. (4.13) 

' 1 + z^v 







h^^]{z)^^-^^^z.{l-5.) f ^—[l + plog{v))dv. (4.14) 
op Jq 1 + 



1-z Jo il + z*v)^ J 
We use the method of moments, such as stated by Remillard and Theodorescu (2000) in a 
similar context. Let < zi < Z2 < Z3 < 1 be three different values, considered as fixed. Let 
5ii32,ff3 be their respective images by ga,p,s- Denote by G = G{a,p,5) the 3-dimensional vector 
(ffii 32, 53)- The mapping {a, p, S) 1 — > G is locally one-to-one, and its inverse can be used to derive 
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an estimate of (a, p, 6) from the natural estimate of G. Let (X,i)„^i be a sequence of independent 
identically distributed random variables, each with PGF ga,p,s- Define the empirical probability 
generating function (EPGF) gn{z) as: 



9n(z) = -V". 

n ^-^ 

i=i 

For i,j = 1, 2, 3, the expectations and covariances of gn{zi) and gn{zj) are easily expressed: 

^3n{Zi)\ = 9a,p,s{Zi) , 

and 

COv[gniZi),gn{Zj)] = c{Zi, Zj) = ga,p,siZiZj) - ga,p,5{Zi)ga,p,s{Zj) . 

Consider the 3-dimensional vector 

G= {gn{zi),gn{z2),gn{z3)) ■ 

Its coordinates are empirical means of independent, identically distributed, bounded random 
variables: hence it is a strongly consistent estimator of G. By the Central Limit Theorem, ^/n(G — 
G) converges in distribution to the trivariate centered normal distribution, with covariance matrix 
C — {c{zi, Zj))ij=i^2.3- Rcmillard and Theodorescu (2000, Proposition 3.1) give a stronger result, 
stating the functional convergence of gn{z) to a Gaussian process. 

The Jacobian matrix of G as a function of (a, p, S) is the following. 



J = 



dga 




dga. 


,P,«(2i) 


dg^ 


,p,i(^i) 




da 




dp 




dS 


dga 




dgc 


,P.«(Z2) 


dg. 


,P,i(^2) 








dp 




dS 


dga 


,P,<5(^3) 


dgc 






, P. .5(^3) 




da 




dp 




dS 



gi{hp,s{zi)-l) 5ia/i|, j-(zi) 

g2{hp,s{z2) - 1) g2ah''pl(z2) g2ah^pliz2) 
\ gsihp^sizs) - 1) gsah'flizs) gsahflizs) 

Admitting that J is non singular, denote by (/) the inverse of the mapping (a, p, S) 1 — > G. 
Then 0(G) is a consistent estimator of (a, p, (5). By Slutsky's theorem, such as formulated by 
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(Remillard and Theodorescu, 2000, Theorem 3.4), ^Jn{(f){G) ~ (a, p, J)) converges in distribution 
to the trivariate centered normal distribution with covariance matrix (J^^)* C J^^ . From there, 
confidence intervals and p- values of hypothesis testing can be obtained by standard procedures 
(see e.g. Anderson (2003)). As was explained in Hamon and Ycart (2012), the main advantage of 
GF estimators is to allow a rcscaling of the sample, which makes the method applicable to large 
values of a.. The idea is to replace z by z^^^ in the definition of gn{z): 



Wc propose to choose for b the q quantile of the sample, where q was experimentally set to 0.1. 



at the beginning of this section, it is numerically unstable, and can only be applied to very large 
samples, beyond the size of those usually collected in fluctuation analysis experiments. Thus we 
have been led to propose other estimators, that will now be described. 

We first assume that S is known. Observe that for z = (5,, = and hp^si^*) ~ S*' hp^s has 
a fixed point at (5,, independently of p. Therefore gn{S*) converges to ga,p,s{^*) — exp(a((5» — 1)). 
Hence log(g((5,))/ ((5, — 1) is a consistent estimator of a, that wc shall call the fixed point estimator. 
It does not depend on p. For (5 = 0, the fixed point estimator is — log(5(0)), and g{0) is the 
proportion of zeros in the sample. Thus the fixed point estimator extends the so called po"mcthod, 
initially proposed by Luria and Delbriick (1943) (sec Foster (2006) for a review on estimation 
methods for the LD(a, p)). The po-niethod, even though it gives acceptable results for low values 
of a, cannot be applied for large a's: the same can be said of the fixed point estimator. 

The best results were obtained for the GF estimators that were developed in Hamon and Ycart 
(2012) for the LD(a, p) case. We bricfiy recall their definition below. 




The estimator 4>{G) is theoretically consistent. However, for intrinsic reasons that were explained 
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Consider the following ratio. 

Jzi,Z2 \P) — , , -I • 

hp,s(Z2) - 1 

The function that maps p onto y ~ fzi,z2{p) is continuous and strictly monotone, hence one-to- 
one. Therefore the inverse, that maps y onto p = fz^^z^iy), is well defined. For < zi < Z2 < 1, 
let y,i(2i,Z2) denote the following log-ratio. 

. / ^ log(g»(zi)) 

yn{zi,Z2) = ,^ , . 

log(gr,(22)) 

An estimator of p is obtained by: 
then an estimator of a by: 

. /cx log(5„(z3)) 
a„(d) - 



^P„(2i,z2)(^3) - 1 ' 

The asymptotic covariance matrix of /5„) given in Proposition 4.1 of Hamon and Ycart (2012) 
is still valid here, replacing hp and its derivative in p by hp s and h^p^g. Scaling by a quantile of 
the sample as was explained above can also be applied. 

If we assume now that p is known and S unknown, the estimators described above are easily 
adapted, by exchanging the roles of p and S, and replacing /i^''] by h^p]- New GF estimators 
and Snip) s-i'^ obtained. 

In practice, neither S nor p can be supposed to be known. For a given value of 5, consider 
the estimators a{6) and p{6) described above. The distributions LDD{a{S), p{S), 6) from different 
values of 6 are not far from each other. To distinguish between them, we propose to use as an 
estimator of 6 the value 6 that minimizes the distance between the theoretical PGF and the 
EPGF of the sample (up to possible rescaling). We shall denote by B = {a{S) , p{5) , 6) this new 
estimator. Unlike (l){G) (which is numerically unstable for small samples), B can be calculated 
on samples of any size. It will be shown in the next section that when both can be calculated, B 
has a better Mean Squared Error (MSE) than 0(G'). 
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5. Experimental results 

Using extensively the simulation procedure described in section 3, we have conducted different 
simulation experiments in order to assess the qualities of the estimators proposed in section 4. 
We have also used the most comprehensive set of data available so far (1102 values), that of 
Boe and others (1994). Our main conclusions are reported in this section. 

The reason why the GF estimator 0(G) cannot be calculated for small samples was explained 
in the previous section. The question arises to compare it, on large samples and for large values 
of a, to the GF estimator B obtained by estimating first a and p on different values of 6, and 
then selecting the set of parameters that minimizes the distance between PGF's. The second one 
consistently gives better results. Here are for instance the MSEs on the estimation of the three 
parameters, on 1000 simulated samples of size 10"' of the LDD(10, 1,0.1). 





a p 6 


B 


0.551 0.019 0.111 
0.481 0.016 0.079 



Notice that both estimators perform quite poorly on the estimation of 6: the MSB is comparable 
to the true value. As explained in the previous section, this must be blamed on the intrinsic lack 
of identifiability of the model, rather than the estimators. 

We then tried to evaluate the quality of different estimators of a, which is the parameter of 
interest in fluctuation analysis. Four estimators were tried on 1000 samples of size 1000 of the 
LDD(a, p, S), computing for each estimator the MSB on a. 

• GBd: the estimate a{S) obtained using the true value of 6; 

• GBr: the estimate a{p) obtained using the true value of p; 

• GFO: the first coordinate d of i? (no prior information); 

• BP: the fixed point estimator (^^gjjjg ^j^q |;j-^g value of S). 
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{a,p, S) 


GFd 


GFr 


GFO 


FP 


(1,1,0.1) 


0.040 


0.087 


0.136 


0.041 


(1,1,0.05) 


0.039 


0.076 


0.144 


0.041 


(1,0.8,0.05) 


0.042 


0.105 


0.166 


0.044 


(10,1,0.1) 


0.272 


0.506 


1.000 


2.677 


(10,1,0.05 


0.894 


1.143 


1.983 


11.84 


(10,0.8,0.05) 


1.000 


1.393 


2.088 


13.87 



Table 2. Mean Squared Errors on 4 estimators of a from 1000 samples of size 1000 of the LDD(a,p, 5) 
for different values of the parameters. The first estimate uses the true value of 5, the second one the true 
value of p, the third one uses no prior information. The last one (fixed point) does not depend on p and 
uses the true value of 5. 

Table 2 shows the MSEs obtained for different sets of parameters. Not surprisingly, using the 
true value of (5 or p gives a better estimate of a; the information on 6 yields a better precision 
than the information on p. Both estimators GFd and FP use the information on 6, but the 
first one is better. For low values of a, FP performs reasonably well, as does the po-method for 
6 = 0. However for large values of a, FP is strongly biased. A value of p smaller than 1 implies 
a heavier tail, hence larger and more frequent outliers. It worsens the estimation of a, whatever 
the estimator. 

Apart from simulation experiments, we have tried estimating a, p, 5 on several samples of real 
data. The results obtained on the 1102 data from Boe and others (1994) are reported here. These 
data were ajusted on the LD(a,p) by Zheng (2005) using the maximum likelihood method: his 
estimates of a and p are 0.71 and 0.84 respectively. Table 3 shows the estimates of a and p 
obtained for values of 6 ranging from to 0.3. It also gives the distances between the empirical 
distribution and the estimated LDD, either in the sense of PGF's or in that of cumulative dis- 
tribution functions. The fit is quite good, whatever the value of S. The value 5 = 0.06 gives the 
best fit in the sense of PGF's. Even though this value is coherent with death probability esti- 
mates reported by Fontaine and others (2008), it cannot be considered as reliable. Indeed, the 
95% confidence margin of error on S, deduced from the asymptotic covariance matrix, is ±0.13. 
This huge margin is coherent with what we have observed on simulated samples with analogous 
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parameters. 



s 


0.00 


0.03 


0.06 


0.09 


0.12 


0.15 


0.18 


0.2 


10.24 


0.27 


0.30 


a{S) 


0.71 


0.72 


0.73 


0.75 


0.77 


0.79 


0.81 


0.83 


0.87 


0.90 


0.95 


m 


0.82 


0.83 


0.84 


0.84 


0.85 


0.86 


0.87 


0.88 


0.90 


0.92 


0.94 


DG 


1.04 


0.87 


0.74 


0.79 


0.86 


0.94 


1.05 


1.19 


1.38 


1.66 


2.06 


DF 


6.48 


6.25 


6.29 


6.53 


6.79 


7.08 


7.41 


7.79 


8.23 


8.74 


9.35 



Table 3. Estimates of a and p for different values of 5 on the set of data of Boe and others (1994). 
On row 4, DG is the maximal distance between the EPGF function of the sample and the PGF of the 
LDD(q(5),/3((5),(5), muhiplied by 10^ On row 5, DF is the distance between cumulative distribution 
functions, also multiplied by 10"^. 

The main conclusion of our experimental study is that the death probability S cannot be reliably 
estimated on samples such that the product a x n is lower than 10^, which is far beyond current 
fluctuation analysis experiments. We remark that available estimates of S have orders of magni- 
tude of a few pereents: see Fontaine and others (2008). Table 1 for theoretical distances as well 
as Table 3 for actual data, permit to evaluate the influence of 6: it turns out that the effect of a 
small S on estimates on the estimates of a and p has the same order of magnitude as 5 itself. So 
neglecting the effect of cell deaths if no reliable estimate of their probability is available, seems 
legitimate. 

6. Conclusion 

A probabilistic model of fluctuation analysis, taking into account cell deaths, has been proposed. 
A new family of distributions LDD(a,p, (5), modeling asymptotic number of mutants has been 
derived. The three parameters are the expected number of mutations a (which is the parameter 
of interest in fluctuation analysis), the relative fitness of normal cells compared to mutants p, 
and the death probability of mutant cells 5. In the particular case 6 = 0, the classic Luria- 
Delbriick distribution is recovered. The extension of known mathematical results to the case 
(5 > is straightforward: explicit simulation and computation algorithms for probabilities have 
been described. The LDD(q;, p, S) has the same type of asymptotic behavior than the Luria- 
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Delbriick distributions: heavy tail with tail exponent p. Thus, the occurrence of "jackpots" (large 
counts of mutants) is a common feature. Modeling an observed sample of mutant counts by a 
LDD(a, p, S) poses the problem of estimating the three parameters simultaneously. If the death 
probability d is known, then a and p can be estimated using the generating function method, 
exactly as in the case (5 = 0. The larger the expected number of mutations a, the more precise 
the estimates. However, for samples of size smaller than 10^, all values of S lead to a good fit, 
and the different distributions so obtained can hardly be distinguished. Choosing for S the value 
giving the best fit yields a consistent estimator with optimal mean squared error, but it cannot 
be considered a reliable choice for small samples. Since the death probabilities that have been 
reported in practice are small, their influence on the estimates of a and p can be neglected 
as a first approximation, if no prior information on the actual value of d is available. A script 
containing the R functions for the statistical treatment of the LDD distributions has been made 
available online. 
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